Large Scale Citation Matching Using Apache Hadoop

نویسندگان

  • Mateusz Fedoryszak
  • Dominika Tkaczyk
  • Lukasz Bolikowski
چکیده

During the process of citation matching links from bibliography entries to referenced publications are created. Such links are indicators of topical similarity between linked texts, are used in assessing the impact of the referenced document and improve navigation in the user interfaces of digital libraries. In this paper we present a citation matching method and show how to scale it up to handle great amounts of data using appropriate indexing and a MapReduce paradigm in the Hadoop environment.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Taming the zoo - about algorithms implementation in the ecosystem of Apache Hadoop

Content Analysis System (CoAnSys) is a research framework for mining scientific publications using Apache Hadoop. This article describes the algorithms currently implemented in CoAnSys including classification, categorization and citation matching of scientific publications. The size of the input data classifies these algorithms in the range of big data problems, which can be efficiently solved...

متن کامل

Matching Dispute Finder Claims to Wikipedia Articles

Dealing with large datasets is increasingly becoming a problem for natural language processing researchers. For our class project we investigate applying the opensource Hadoop MapReduce framework to the problem of information retrieval using TFIDF.

متن کامل

ARPN Journal of Science and Technology::Analysis of Movie Lens Data Set using Hive

Large scale data set provides the better opportunity to find out much better data relationship in the area of business intelligence. In the paper, we implement our systems using Hadoop that has been popular to store and compute Big Data. However, it is not easy to write Hadoop Map Reduce code. Therefore, we use Hive and Hive QL codes to understand the relationships between ratings and the users...

متن کامل

Survey on Information Retrieval and Pattern Matching for Compressed Data Size using the SVD Technique on Real Audio Dataset

Due to increasing size of text and audio data over internet, various techniques are needed to help with the finding and extraction of very specific information relevant to a user's task. Text mining is a variant on a field called data mining that tries to discover curious patterns from large databases. Singular value decomposition this technique is used for dimensionality reduction of large dat...

متن کامل

Document Clustering Through Non-Negative Matrix Factorization: A Case Study of Hadoop for Computational Time Reduction of Large Scale Documents

In this paper we discuss a new model for document clustering which has been adapted using non-negative matrix factorization method. The key idea is to cluster the documents after measuring the proximity of the documents with the extracted features. The extracted features are considered as the final cluster labels and clustering is done using cosine similarity which is equivalent to k-means with...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2013